Software Fault Tolerance in a Clustered Architecture: Techniques and Reliability Modeling

نویسندگان

  • Michael R. Lyu
  • Veena B. Mendiratta
چکیده

System architectures based on a cluster of computers have gained substantial attention recently. In a clustered system, complex software-intensive applications can be built with commercial hardware, operating systems, and application software to achieve high system availability and data integrity, while performance and cost penalties are greatly reduced by the use of separate error detection hardware and dedicated software fault tolerance routines. Within such a system a watchdog provides mechanisms for error detection and switch-over to a spare or backup processor in the presence of processor failures. The application software is responsible for the extent of the error detection, subsequent recovery actions and data backup. The application can be made as reliable as the user requires, being constrained only by the upper bounds on reliability imposed by the clustered architecture under various implementation schemes. We present reliability modeling and analysis of the clustered system by defining the hardware, operating system, and application software reliability techniques that need to be implemented to achieve different levels of reliability and comparable degrees of data consistency. We describe these reliability levels in terms of fault detection, fault recovery, volatile data consistency, and persistent data consistency, and develop a Markov reliability model to capture these fault detection and recovery activities. We also demonstrate how this cost-effective fault tolerant technique can provide quantitative reliability improvement within applications using clustered architectures.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Integrated Approach to Achieving High Software Reliability

In this paper we address the development, testing, and evaluation schemes for software reliability, and the integration of these schemes into a unified and consistent paradigm. Specifically, techniques and tools for the three phases of software reliability engineering will be described. The three phases are (1) modeling and analysis, (2) design and implementation, and (3) testing and measuremen...

متن کامل

Coverage-based testing strategies and reliability modeling for fault-tolerant software systems

Software permeates our modern society, and its complexity and criticality is ever increasing. Thus the capability to tolerate software faults, particularly for critical applications, is evident. While fault-tolerant software is seen as a necessity, it also remains as a controversial technique and there is a lack of conclusive assessment about its effectiveness. This thesis aims at providing a q...

متن کامل

System-Level Reliability and Sensitivity Analyses for Three Fault-Tolerant System Architectures

This paper discusses the modeling and analysis of three major fault-tolerant software system architec-tures: DRB (Distributed Recovery Blocks), NVP (N-Version Programming) and NSCP (N Self-Checking Programming). In the system-level reliability modeling domain, fault tree analysis techniques and Markov reward modeling techniques are combined to incorporate transient and permanent hardware faults...

متن کامل

Investigation on Reliability Estimation of Loosely Coupled Software as a Service Execution Using Clustered and Non-Clustered Web Server

Evaluating the reliability of loosely coupled Software as a Service through the paradigm of a cluster-based and non-cluster-based web server is considered to be an important attribute for the service delivery and execution. We proposed a novel method for measuring the reliability of Software as a Service execution through load testing. The fault count of the model against the stresses of users ...

متن کامل

Techniques for Modeling the Reliability of Fault-Tolerant Systems With the Markov State-Space Approach

This paper presents a step-by-step tutorial of the methods and the tools that were used for the reliability analysis of fault-tolerant systems. The approach of this paper is the Markov (or semi-Markov) state-space method. The paper is intended for design engineers with a basic understanding of computer architecture and fault tolerance, but little knowledge of reliability modeling. The represent...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999